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18 Abstract: Understanding the patterns and processes underlying the uneven distribution of 

19 biodiversity across space and time constitutes a major scientific challenge in evolutionary 

20 biology. With rapidly accumulating species occurrence data, there is an increasing need for 

21 making the process of coding species into operational units for biogeographic and evolutionary 

22 analyses faster, automated, transparent and reproducible. Here we present SpeciesGeoCoder, a 

23 free software package written in Python and R, that allows for easy coding of species into user- 

24 defined areas. These areas may be of any size and be purely geographical (i.e., polygons) such as 

25 political units, conservation areas, biomes, islands, biodiversity hotspots, and areas of endemism, 

26 but may also include altitudinal ranges. This flexibility allows scoring species into complex 

27 categories, such as those encountered in topographically and ecologically heterogeneous 

28 landscapes. In addition, SpeciesGeoCoder can be used to facilitate sorting and cleaning of 

29 occurrence data. The various outputs of SpeciesGeoCoder include quantitative biodiversity 

30 statistics, global and local distribution maps, and NEXUS files that can be directly used in many 

31 phylogeny -based applications for ancestral state reconstruction, investigations on biome 
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32 evolution, and diversification rate analyses. Our simulations indicate that even datasets 

33 containing hundreds of millions of records can be analysed in relatively short time using a 

34 regular desktop computer. We exemplify the use of our program through two contrasting 

35 examples: i) inferring historical dispersal of birds across the Isthmus of Panama, separating 

36 lowland vs. montane species and optimising the results onto a species-level, dated phylogeny; 

37 and ii) exploring seasonal variations in the occurrence of 10 GPS-tracked individuals of moose 

38 (Alces alces) over one year in northern Sweden. These analyses show that SpeciesGeoCoder 

39 allows an easy, flexible and fast categorisation of species distribution data for various analyses in 

40 ecology and evolution, with potential use at different spatial, taxonomic and temporal scales. 
41 

42 Key words: Biodiversity - Biogeography - Bioinformatics - Species distributions - Evolution. 
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43 Introduction 

44 Species distributions provide the basic knowledge for biodiversity research, allowing us to 

45 understand their environmental requirements, biogeographic history, and expected resilience to 

46 climate change. However, analysing the distribution of the world's estimated 8.7 million species 

47 (Mora, C, Tittensor, D.P., et al. 201 1) remains a major scientific challenge. 
48 

49 There are now over half a million species occurrences worldwide freely available through the 

50 Global Biodiversity Information Facility (GBIF; http : //www .gbif.org ) and their partners, of 

51 which the majority (about 446 million) are geo-referenced, i.e. provided with a latitude and 

52 longitude. These numbers are steadily increasing thanks to new agreements on data sharing, on- 

53 going digitalisation programmes, the work of active field biologists and amateurs combined with 

54 an increasing awareness of the need to provide rich metadata with biological collections and 

55 sequences (Hyde KD, U.D., Manamgoda DS, Tedersoo L, Nilsson RH. 2013), and tools that 

56 enable automated geo-referencing of older museum specimens (Guralnick, R.P., Wieczorek, J., 

57 et al. 2006, Garcia-Milagros, E. and Funk, V.A. 2010). Publicly available species occurrences 

58 represent an enormous data source for biodiversity research, but are as yet poorly exploited due 

59 to two main factors: i) general scepticism concerning the quality of records available, in terms of 

60 species identification and precise coordinates (Hjarding, A., Tolley, K.A., et al. 2014), and ii) 

61 demonstrated taxonomic, geographic, and temporal biases (Boakes, E.H., McGowan, P.J., et al. 

62 2010). Whereas these biases are slowly being compensated by data growth, improving quality 

63 typically relies on expert curation and targeted evaluations (e.g. http ://www .iucnredlist .org/ ) as 

64 well as tools for massive data cleaning (e.g. through workflows at the Biodiversity Virtual e- 

6 5 Laboratory , http://www.biovel.eu ) . 
66 

67 Research using species distribution data has been further hampered by a lack of tools that allow 

68 for classification of raw occurrences into discrete categories. This is often a crucial step in many 

69 evolutionary and biogeographic analyses, since such categories can then be used in connection 

70 with e.g. a phylogeny for ancestral range reconstructions (Pagel, M., Meade, A., et al. 2004, Ree, 

71 R.H. and Smith, S A. 2008, Matzke, N J. 2013) and for the estimation of area-dependent 

72 inferences of diversification rates (Goldberg, E.E., Lancaster, L.T., et al. 201 1 , Silvestro, D., 

73 Schnitzler, J., et al. 201 1 , FitzJohn, R.G. 2012). Species categorisation into commonly 
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74 recognised areas such as eco-regions and realms (Olson, D.M., Dinerstein, E., et al. 2001 , Abell, 

75 R., Thieme, M.L., et al. 2008, Holt, B.G., Lessard, J.-P., et al. 2013) may also reveal patterns of 

76 biodiversity and distribution at a large scale, facilitating the identification of regions with 

77 outstanding levels of species richness and endemism - central to the concept of Biodiversity 

78 Hotspots (Myers, N., Mittermeler, R.A., et al. 2000). 
79 

80 Rapidly increasing amounts of species occurrence data, the need to classify species into discrete 

81 areas in an automated, reproducible and transparent way tailored to evolutionary biologists, have 

82 led us to develop SpeciesGeoCoder. 
83 

84 Description 

85 SpeciesGeoCoder is written in Python ( https ://www .python.org/ ) and makes use of the R 

86 software environment ( http : //www .r-proj ect .org/ ) for statistics and plotting of results. It runs on 

87 all platforms for which these programs are available, including Windows, MacOS X, 

88 GNU/Linux and BSD, and can be run either through the command-line or through a graphical 

89 user interface (GUI). The GUI provides researchers that are not familiar with using command- 

90 line programs, an easy-to-use set of method for analysing large amounts of geo-referenced data. 

91 The source code consists of a set of python modules, and the analyses of GeoTIFF files are done 

92 using the GDAL python bindings ( http : //www . gdal .org/ ) for fast execution. The basic workflow 

93 describing the package is illustrated in Figure 1 . 
94 

95 SpeciesGeoCoder works as follows: 

96 1 . The user provides input data in two files: A) species occurrence data (including species 

97 names, latitude and longitude, in tab-delimited text format [see example files distributed with 

98 the program] or in the format directly provided by GBIF); and B) a text file defining the 

99 areas to be used for coding, i.e. a list of polygon names and coordinates for each of its edges, 

100 and (optionally) an altitudinal range (e.g., between 500 - 1000 meters above sea level). The 

101 polygons can be easily designed through various GIS tools (e.g. the freely available program 

102 QGIS, http://qgis.osgeo.org ). If an altitudinal range is provided, additional altitudinal data 

103 files for the regions covered and at the resolution desired (in GeoTIFF format) are needed. 
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104 This type of data can be freely downloaded from various on-line resources (see 

105 SpeciesGeoCoder's wiki for a list of repositories) . 

106 2. SpeciesGeoCoder loops through all samples in the input file, counting the presences for each 

107 species in each polygon. 

108 3. The default output is a NEXUS file (Maddison, D.R., Swofford, D.L., et al. 1997) containing 

109 a data matrix with all analysed species and their presence or absence in each area coded as ' 1 ' 

110 or '0', respectively. 'Presence' requires by default at least a single occurrence in an area, but 

111 may be set to require a number above a user-defined threshold (either an absolute number, a 

112 percentage of the occurrences, or both). Alternatively, the user may ask for a more complex 

113 output including the number of occurrences in each polygon, which could aid the 

114 identification of outliers that could require further verification; this information appears as 

115 comments in the NEXUS files. This file can be directly analysed in programmes that handle 

116 the NEXUS input format, such as Mesquite (Maddison, W.P. and Maddison, D.R. 2009) and 

117 most phylogenetic packages written in R, such as APE (Paradis, E., Claude, J., et al. 2004), 

118 GEIGER (Harmon, L.J., Weir, J.T., et al. 2008), and Diversitree (FitzJohn, R.G. 2012), as 

119 well as others written in Python, such as BayesRates (Silvestro, D., Schnitzler, J., et al. 201 1). 

120 4. The second (optional) result is a series of summary statistics and distribution maps. These 

121 include multiple PDF documents with bar charts as graphical representations of the number 

122 of species per area, the number of occurrences per species per area, and the relative 

123 occurrence per area for each species. The summary tables used for the graphical output are 

124 also provided as tab-delimited text files to facilitate downstream analyses. The distribution 

125 maps plot all occurrence points and the areas included in the analyses, coded at the species 

126 level. In addition, for datasets including less than 40 species, a co-existence matrix for each 

127 area is calculated and visualized as a heat plot. 

128 5. The third (optional) result is a series of plots summarising the historical movements (range 

129 expansions or dispersals) of lineages between all pairs of user-defined areas, based on a dated 

130 phylogeny (or a sample of trees, to account for topological and divergence time uncertainties). 

131 These plots are computed with novel scripts that employ available packages in R, stochastic 

132 mapping of shifts in transitions along branches, and the computation of absolute as well as 

133 relative numbers of dispersals through time, i.e. corrected by the number of lineages 

134 (Silvestro, D. 2012, Fernandez -Mendoza, F. and Printzen, C. 2013). Other methods for the 
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135 reconstruction of ancestral areas, e.g. incorporating the possibility of vicariance and founding 

136 effects (Ree, R.H. and Smith, S. A. 2008, Matzke, NJ. 2013), may be readily used based on 

137 the output of SpeciesGeoCoder but are not implemented in the first release of this package. 
138 

139 Benchmark 

140 We evaluated the performance and scalability of SpeciesGeoCoder through a series of 

141 simulations. We focused on how computing time is determined by three key variables: i) the 

142 number of geo-referenced occurrences, ii) the number of polygons, and Hi) the complexity of the 

143 polygons, measured by their number of edges (corners). The occurrence dataset (i) was simulated 

144 as a set of points evenly distributed longitudinally and latitudinally (see example in Fig. SI). The 

145 polygon dataset (ii) was generated in a similar way, by creating a grid of square polygons sharing 

146 two corners with each of its neighbouring polygons (Fig. S2). The edge dataset (iii) was 

147 generated by creating one square polygon and successively adding corners equally distributed 

148 over its perimeter (Fig. S3). The simulations were performed with a logarithmically increase in 

149 the number of occurrences, polygons, and polygon edges, ranging between 10' and 10 8 . 
150 

151 Figure 2 summarises the results of these simulations, and shows that there is a linear increase in 

152 computing time in relation to the three variables examined. These results also show that 

153 SpeciesGeoCoder can handle vast amounts of data within feasible time using regular computer 

154 hardware. 

155 

156 Biological Examples 

157 Bird biogeography. We inferred the historical dispersal of montane and lowland bird lineages 

158 through time across the Central American Seaway, which separated North and South America 

159 for millions of years until the emergence of the Isthmus of Panama. First, we downloaded the full 

160 occurrence dataset including c. 10,000 species and 200,000,000 records obtained from 

161 http : //www .ebird .org (eBird 2013). We then used SpeciesGeoCoder to identify all records found 

162 outside the American continent, as well as all those found north of Mexico, which we then 

163 excluded from further analyses. Species pertaining to the remaining records were coded as 

164 occurring in Central America, South America, or both. Rather than using political boundaries, 

165 we defined the border between these two areas following the Uramita fault (Montes, C, Bayona, 
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166 G., et al. 2012) that separates the South American and Panamanian geological plates (polygons 

167 shown in Fig. S4). We created two operational units from each polygon, one with the additional 

168 condition of only including records occurring below 1000 meters above sea level, and the other 

169 for records occurring above this altitudinal threshold, following the same categorisation as Weir 

170 (2006). We then reconstructed ancestral areas onto the species-level dated phylogeny of birds 

171 from Jetz et al. (2012), using stochastic mapping to reconstruct the historical dispersal of 

172 lineages through time among these four operational units. We calculated both the total (absolute) 

173 as well as the relative (in proportion to the number of lineages) number of dispersals between 

174 each pair of areas, using bins of 10 million years. Due to issues with synonymy, species concepts, 

175 and/or lack of geo-referenced data (issues which we could not readily solve computationally), the 

176 analyses dataset was reduced to include 4350 species. 
177 

178 Our results (Fig. 3) suggest that dispersals between the lowlands of South and Central America 

179 occurred considerably more frequently (c. 2-4 times) than dispersals between the highlands of 

180 those landmasses. There were no major differences in directionality of dispersals, except for the 

181 last time bin considered (0-10 Ma), when northwards dispersals dominated. This supports the 

182 conclusion by Weir et al. (2009) that birds mainly followed an opposite route during the Great 

183 American Biotic Interchange as compared to mammals, which migrated mostly southwards 

184 (Stehli, F.G. and Webb, S.D. 1985). The rate of dispersals increased for all categories in the most 

185 recent time bin, reflecting the full emergence of the Panama Isthmus. 
186 

187 Seasonal behaviour in Swedish moose. We also tested the functionality of SpeciesGeoCoder at a 

188 much lower taxonomic, spatial and temporal level, by analysing a dataset of ten GPS-tracked 

189 individuals of moose {Alces alces) over one year in northern Sweden (c. 80,000 records). The 

190 position data were retrieved from the Wireless Remote Animal Monitoring database system for 

191 data validation and management (Dettki, H., Ericsson, G., et al. 2013). We created five arbitrary 

192 polygons, as if they would correspond to e.g. planned national parks or areas for urban 

193 development, where knowledge on wildlife activity could play a role in decision making. We 

194 calculated the co-occurrence of individuals in each of those polygons during four seasons, 

195 correcting for season length (which followed a local meteorological classification, resulting in 

196 seasons of different numbers of days: spring 60, summer 48, autumn 71 , winter 186). 
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197 

198 The results of the moose analysis are plotted together with the original data points in Fig. 4. 

199 Moose individuals spent most of the time in the north-western polygon, despite their being 

200 concentrated to a narrow strip. Co-occurrence patterns varied among seasons, with individuals 

201 being more clustered during the spring (when calves are usually born) than in the other seasons, 

202 especially in the winter (when food is more scarce). 
203 

204 Further prospects 

205 Beyond the examples provided here, the output obtained by SpeciesGeoCoder could be readily 

206 used for e.g. calculating measures of alpha, beta and gamma diversity; the identification of 

207 neglected areas for conservation; and providing real-time detection of GPS-tagged animals 

208 entering and leaving protected areas. In addition, the visualisation and coding of species into 

209 areas may greatly facilitate cleaning up occurrence databases, by enabling the identification of 

210 outliers that may require additional examination and possible exclusion from downstream 

211 analyses. 
212 

213 Availability 

214 SpeciesGeoCoder is free software available under the GPL3 licence from 

215 https://github.com/mtop/speciesgeocoder/releases . The release and its associated Wiki page 

216 ( https://github.com/mtop/speciesgeocoder/wiki ) include installation instructions for Mac OSX, 

217 Gnu/Linux (and other UNIX- like systems) and Windows, bundled installation packages, 

218 example files, tutorials, a list of online repositories for GeoTIFF data, pre-defined polygons, and 

219 other useful links. A graphical user interface that provides the core functions of the program is 

220 available at http : //sourceforge .net/proj ect s/species geocodergui . All pages can be accessed from 

221 http://antonelli-lab .net . 
222 
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317 Figure legends 
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320 Figure 1. Simplified workflow of the SpeciesGeoCoder package. 
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Computing time depending on the number of polygons 
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322 Figure 2. Computational time in relation to the increase in the number of polygons, polygon 

323 complexity (number of edges) and number of species occurrences, as estimated through 

324 simulations. As a comparison to empirical data, the dot denoted 'Birds' corresponds the coding 

325 of all c. 200,000,000 bird occurrences available from ebird.org; while 'Moose' corresponds to 

326 the coding of c. 80000 records, as described in the text. 
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Historical events of bird dispersals between Central and South America 
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Figure 3. Relative frequency (Y-axis) of historical range movements (dispersals) of lowland and 
highland bird lineages between South America (SA) and Central America (CA), calculated per 
10-million-year time bins. The total number of dispersal events inferred for our dataset is 
indicated between brackets in the legend. 



Downloaded from http://biorxiv.org/on September 18, 2014 



333 




334 Figure 4. Spatial and seasonal variation of occurrences for 10 individuals of moose (Alces alces) 

335 in northern Sweden, scaled to the number of days in each season. The numbers on the heat scale 

336 correspond to hourly records obtained from the Wireless Remote Animal Monitoring database 

337 system and span over one year. 



